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ABSTRACT 

A number of computer vision tasks exploit a succinct representation 
of the visual content in the form of sets of local features. Given an 
input image, feature extraction algorithms identify a set of keypoints 
and assign to each of them a description vector, based on the char¬ 
acteristics of the visual content surrounding the interest point. Sev¬ 
eral tasks might require local features to be extracted from a video 
sequence, on a frame-by-frame basis. Although temporal downsam¬ 
pling has been proven to be an effective solution for mobile aug¬ 
mented reality and visual search, high temporal resolution is a key 
requirement for time-critical applications such as object tracking, 
event recognition, pedestrian detection, surveillance. In recent years, 
more and more computationally efficient visual feature detectors and 
decriptors have been proposed. Nonetheless, such approaches are 
tailored to still images. In this paper we propose a fast keypoint 
detection algorithm for video sequences, that exploits the temporal 
coherence of the sequence of keypoints. According to the proposed 
method, each frame is preprocessed so as to identify the parts of the 
input frame for which keypoint detection and description need to be 
performed. Our experiments show that it is possible to achieve a 
reduction in computational time of up to 40%, without significantly 
affecting the task accuracy. 

Index Terms — Local features, keypoint detection, video. 

1. INTRODUCTION 

In recent years, ubiquitous computer vision applications are pervad¬ 
ing our lives. Smartphones, self-driving terrestrial and aerial vehi¬ 
cles, Visual Sensor Networks (VSNs) are capable of acquiring visual 
data and performing complex analysis tasks. In particular, VSNs are 
expected to play a major role in the advent of the Internet-of-Things 
paradigm. Such computer vision tasks usually exploit a concise yet 
effective representation of the acquired visual content, rather than 
being based on the pixel-level content. In this context, local features 
represent an effective solution that is being successfully exploited for 
a number of tasks such as content-based retrieval, object tracking, 
image registration, etc. Local feature extraction algorithms usually 
consist of two distinct components. First, a keypoint detector aims at 
identifying salient regions (e.g. corners, blobs) within a given image. 
Second, a descriptor assigns each identified keypoint a descriptor, in 
the form of a set of values, based on the local characteristics of the 
image patch surrounding such keypoint. Such information is further 
processed in order to extract a semantic representation of the ac¬ 
quired content, e.g., by identifying and tracking objects, recognizing 
faces, monitoring the environment and recognizing events. 

As regards visual feature extraction algorithms, SIFT Q] is 
widely considered as the state-of-the-art for a large number of 
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tasks. It consists in a keypoint detector based on the Difference-of- 
Gaussians (DoG) algorithm, and in a scale- and rotation-invariant 
real-valued descriptor, based on local intensity gradients. Besides, 
surf m is partially inspired by SIFT and aims at achieving a sim¬ 
ilar level of accuracy at a lower computational cost. More recently, 
several low-complexity algorithms have been proposed, with the ob¬ 
jective of alleviating the computational burden required by both tra¬ 
ditional keypoint detectors and descriptors. For example. FAST ( 3 ) 
and AGAST (4) are computationally efficient detectors capable of 
identifying stable corners. As for descriptors, binary-valued features 
are emerging as an efficient alternative to traditional real-valued 
features. BRIEF 0, BRISK (6j|, FREAK Q and BAMBOO © are 
instances of such category. For each identified keypoint, they com¬ 
pute a descriptor vector in the form of a sequence of binary values, 
each of which is obtained by comparing the (smoothed) intensities 
of a pair of pixels sampled around the keypoint. In some cases, 
ad-hoc software-based implementations are available for specific 
hardware architectures (9). 

Local feature detection in video sequences has been addressed in 
the past literature, with the goal of identifying keypoints that are sta¬ 
ble across time. For example, Shi and Tomasi 02) propose a widely 
adopted detector suitable for tracking applications. Zhang et. al 
propose a complex video-retrieval system based on color, shape and 
texture features extracted from the key-frames of a video in)- More 
recently, Zha et al. propose a method to extract spatio-temporal fea¬ 
tures from video content 02 ) • Besides being a key to tasks such 
as object tracking, event identification and video calibration, tempo¬ 
rally stable features improve the efficiency of coding architectures 
tailored to features extracted from video content Ifl3lll4lfl5l . More 
recently, Girod et al. COD propose a feature detection and coding 
algorithm inspired by traditional motion estimation methods. Such 
algorithm selects a set of features corresponding to canonical image 
patches whose content is stable across frames, leading to a signif¬ 
icant reduction of the transmission bitrate thanks to ad-hoc coding 
primitives. Although such algorithm represents a good solution for 
applications that require the efficient transmission of local features 
for further processing, it might not be the best in terms of com¬ 
putational complexity. Considering low-power devices, computa¬ 
tionally intensive operations might significantly reduce the detection 
frame rate, possibly impairing performance of time-critical tasks or 
introducing undue delay. In this paper, we introduce a fast detec¬ 
tion algorithm based on BRISK 0 and tailored to the context of 
video sequences, aimed at reducing the computational complexity 
and thus enabling high frame rates, without significantly affecting 
performance in terms of accuracy. 

The rest of this paper is organized as follows. Section [^intro¬ 
duces the main concepts behind BRISK. Section[3]illustrates the pro¬ 
posed fast detection architecture. Section[4]defines the experimental 
setup and presents results. Finally, conclusions are drawn in Sec¬ 
tion [5] 



2. BINARY ROBUST INVARIANT SCALABLE 
KEYPOINTS (BRISK) 

Leutenegger et al. [6j propose the Binary Robust Invariant Scalable 
Keypoints (BRISK) algorithm as a computationally efficient alterna¬ 
tive to traditional local feature detectors and descriptors. The algo¬ 
rithm consists in two main steps: i) a keypoint detector, that identi¬ 
fies salient points in a scale-space and ii) a keypoint descriptor, that 
assigns each keypoint a rotation- and scale- invariant binary descrip¬ 
tor. Each element of such descriptor is obtained by comparing the 
intensities of a given pair of pixels sampled within the neighborhood 
of the keypoint at hand. 

The BRISK detector is a scale-invariant version of the lightweight 
FAST 0 corner detector, based on the Accelerated Segment Test 
(AST). Such a test classifies a candidate point p (with intensity I p ) 
as a keypoint if n contiguous pixels in the Bresenham circle of ra¬ 
dius 3 around p are all brighter than I p + t, or all darker than I p — t, 
with t a predefined threshold. Thus, the highest the threshold, the 
lowest the number of keypoints which are detected and vice-versa. 

Scale-invariance is achieved in BRISK by building a scale-space 
pyramid consisting of a pre-determined number of octaves and intra¬ 
octaves, obtained by progressively downsampling the original im¬ 
age. The FAST detector is applied separately to each layer of the 
scale-space pyramid, in order to identify potential regions of interest 
having different sizes. Then, non-maxima suppression is applied in 
a 3x3 scale-space neighborhood, retaining only features correspond¬ 
ing to local maxima. Finally, a three-step interpolation process is 
applied in order to refine the correct position of the keypoint with 
sub-pixel and sub-scale precision. 

3. FAST VIDEO FEATURE EXTRACTION 

Let I n denote the n-th frame of a video sequence of size N x x Ny, 
which is processed to extract a set of local features D„. First, a key- 
point detector is applied to identify a set of interest points. Then, 
a descriptor is applied on the (rotated) patches surrounding each 
keypoint. Hence, each element of d n ,i £ 'Dn is a visual feature, 
which consists of two components: i) a 4-dimensional vector p„, j = 
[x,y,a,d] T , indicating the position (x,y), the scale a of the de¬ 
tected keypoint, and the orientation angle 8 of the image patch; ii) a 
P-dimensional binary vector £ {0,1} P , which represents the 
descriptor associated to the keypoint p„ ,i. 

Traditionally, local feature extraction algorithms have been de¬ 
signed to efficiently extract and describe salient points within a sin¬ 
gle frame. Considering video sequences, a straightforward approach 
consists in applying a feature extraction algorithm separately to each 
frame of the video sequence at hand. However, such a method is 
inefficient from a computational point of view, as the temporal re¬ 
dundancy between contiguous frame is not taken into consideration. 
The main idea behind our approach is to apply a keypoint detection 
algorithm only on some regions of each frame. To this end, for each 
frame I n , a binary Detection Mask M n £ {0,1} Ax x Ny having the 
same size of the input image is computed, exploiting the informa¬ 
tion extracted from previous frames. Such mask defines the regions 
of the frame where a keypoint detector has to be applied. That is, 
considering an image pixel I„(x, y ), a keypoint detector is applied 
to such a pixel if the corresponding mask element M n (x, y) is equal 
to 1. Furthermore, we assume that if a region of the n-th frame is not 
subject to keypoint detection , the keypoints that are present in such 
an area in the previous frame, i.e. I n -i, are still valid. Hence, such 
keypoints are propagated to the current set of features. That is. 


Tin — : Adn(Pn,t) — 1 U dn — l,j ffin (Pn — 1 ,j ) — 0} (1) 

Note that the algorithm used to compute the Detection Mask 
needs to be computationally efficient, so that the savings achievable 
by skipping detection in some parts of the frame are not offset by this 
extra cost. In the following, two efficient algorithms for obtaining a 
Detection Mask are proposed: Intensity Difference Detection Mask 
and Keypoint Binning Detection Mask. 

3.1. Intensity Difference Detection Mask 

The key tenet is to apply the detector only to those regions that 
change significantly across the frames of the video. In order to iden¬ 
tify such regions and build the Detection Mask, we exploit the scale- 
space pyramid built by the BRISK detector, thus incurring in no extra 
cost. Considering frame I n and O detection octaves, pyramid layers 
C n ,o, o = 1,..., O are obtained by progressively smoothing and 
half-sampling the original image, as illustrated in Section [2] Then, 
considering two contiguous frames X n _i and I n and octave o, a 
subsampled version of the Detection Mask is obtained as follows: 


M' n , 0 (k,l) 


1 if I C n ,o(k, l) - £ n _i, 0 (fc, ()| < Ti 

0 if I C n ,o{k, l) - Cn-l,o(k, 01 > 77, 


where 77 is an arbitrarily chosen threshold and ( k , l) the coordi¬ 
nates of the pixels in the intermediate representation M' n ,o- Finally, 
the intermediate representation M' n , 0 resulting from the previous 
operation needs to be upsampled in order to obtain the final mask 
M n £ {0,1}"* XJ V Masks can then be applied to detection in 
different fashions: i) exploiting the mask obtained resorting to each 
scale-space layer o = 1,..., O in order to detect keypoint at the 
corresponding layer o; ii) use a single detection mask for all the 
scale-space layers. 

3.2. Keypoint Binning Detection Mask 

Considering two contiguous frames of a video sequence, the amount 
of features identified in a given area are often correlated (ED. To ex¬ 
ploit such information, the detector is applied to a region of the input 
image only if the number of features extracted in the co-located re¬ 
gion in the previous frame is greater than a threshold. Specifically, in 
order to obtain a Detection Mask for the n —th frame, a spatial bin¬ 
ning process is applied to the features extracted from frame I n ~\. 
To this end, we define a grid consisting of M r x M c spatial bins 
Bi,j, i = 0,... ,N r ,j = 0,..., Me- Thus, each bin refers to a rect¬ 
angular area of S x x S y pixels, where S x = n */m c and S y = N v/u r . 
Then, a two-dimensional spatial histogram of keypoints is created by 
assigning each feature to the corresponding bin as follows: 


Mn{k,l) = \dn-l.i £ T>n-l\ ■ [ x ™~ 1 ,i/s. J = k, \ Vn-X,i/ S y \ = l, 

(3) 

where (x n -i,i, Vn-i,i) represents the location of feature d n -i,i 
and | • | the number of elements in a set. Then, a binary subsam¬ 
pled version of the Detection Mask is obtained by thresholding such 
histogram, employing a tunable threshold Th : 


M' n (k, l) 


1 if M'n{k,l)>T H 
0 if M'n(k, l) < Th, 


(4) 


Finally, the Detection Mask M n having size N x x N y pixels is 
obtained by upsampling the intermediate representation M' n . Such 
a detection mask is applied to all scale-space octaves. 


4. EXPERIMENTS 


150 


Dataset: We evaluated the proposed algorithms with respect to 
three different test scenarios. First, we exploited the Stanford MAR 
dataset ns. containing the four VGA size, 200 frames long video 
sequences Alicia Keys, Fogelberg, Anne Murray and Reba. Each 
sequence contains a CD cover recorded with a hand-held mobile 
phone, under different imaging conditions such as illumination, 
zoom, perspective, rotation, glare, etc. Furthermore, for each se¬ 
quence, the dataset contains the ground truth information, in the 
form of a still image of the corresponding CD cover, having a size 
of 500 x 500 pixels. 

As a second test, we evaluated the approaches resorting to the 
Rome Landmark Dataset. Such dataset includes a set of 10 query 
video sequences, each capturing a different landmark in the city 
of Rome with a camera embedded in a mobile device fill . The 
frame rate of such sequences is equal to 24fps, whereas the resolu¬ 
tion ranges from 480x360 pixels (4:3) to 640x360 pixels (16:9). The 
first 50 frames of each video were used as query. On average, each 
query video corresponds to 9 relevant images representing the same 
physical object under different conditions and with heterogeneous 
qualities and resolutions. Then, distractor images randomly sampled 
from the MIRFLICKR-1M dataset fT9| , so that the final database 
contains 10k images. 

Finally, we tested our method on the Stanford MAR multiple ob¬ 
ject video set DU- Such a set is made up of 4 video sequences, 
each consisting of 200 frames at 640x480 resolution. Each video 
is recorded with a handheld camera and portrays three different ob¬ 
jects, one at a time. 

Methods: We tested the two detection methods presented in 
Section[3] that is, Intensity Difference Detecion Mask and Keypoint 
Binning Detection Mask. In both cases, we employed the original 
BRISK implementation from the authors’] setting the number of oc¬ 
taves to 4 and the detection threshold to 55 and 70 for the Stanford 
MAR dataset and the Rome landmark dataset, respectively. As re¬ 
gards Intensity Difference Detection Mask, we built the mask testing 
several different configurations. We tested our algorithm with the 4 
layers corresponding to each scale-space octaves. Since the perfor¬ 
mance was similar when using different layers, we resorted to the 
top-layer, i.e., the one with the lowest spatial resolution and process¬ 
ing cost. Both Intensity Difference Detection Mask and Keypoint 
Binning Detection Mask require a threshold to be set in order to ob¬ 
tain the final detection mask. We tested several different configura¬ 
tions, each representing a tradeoff between computational efficiency 
and task accuracy. 

We compared our algorithms with a Temporally Coherent Detec¬ 
tor based on non-canonical patch matching cm, which also exploits 
temporal redundancy in the detected keypoints. Such algorithm aims 
at propagating stable keypoints across frames, exploiting a pixel- 
level representation of local features. In details, a traditional key- 
point detector is applied to the first frame of a Group Of Pictures of 
size A. Given an identified keypoint, a non-canonical square image 
patch is extracted from the neighborhood of such a point. Then, con¬ 
sidering the following frame, we searched for a matching patch in a 
window surrounding such a keypoint. Two patches are assumed to be 
a match if the Sum of Absolute Differences (SAD) between their pix¬ 
els is below a given threshold Tbm■ Finally, keypoints for which a 
match is found are propagated to the next frame, and their position is 
determined by the aforementioned block matching procedure. In our 
tests, according to the prescriptions of 03. we employed patches 


1 http://www.asl.ethz.ch/people/lestefan/personal/BRISK 
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Fig. 1. Accuracy, measured as the number of matches post- 
RANSAC (MPR), and computational time for each frame of the Ali¬ 
cia Keys test sequence. 


of 16 x 16 pixels and we set A = 10 and Tbm = 1800. Further¬ 
more, to make the procedure faster, we implemented a coarse-to-fme 
matching algorithm, where the first step consists in a spiral search al¬ 
gorithm with a precision of 2 in a search window of 24 x 24 pixels, 
whereas the second step in a spiral search algorithm with quarter- 
pixel precision in a search window of 1.75 x 1.75 pixels. Finally, 
to further speed-up the process, we set an early termination SAD 
threshold Tet = 1000. This detector was originally proposed with 
the goal of maximizing coding efficiency, when patches around the 
detected keypoints need to be compressed and transmitted. To this 
end, this method can also adopt more sophisticated matching strate¬ 
gies, e.g., based on affine warping. However, in this paper we con¬ 
sider an implementation based on block matching to minimize the 
computational complexity. 

Evaluation methods and measures: In the case of the Stanford 
MAR dataset, for a given video sequence, we extracted a set of fea¬ 
tures for each frame. Then, the set of features extracted from a frame 
is matched with the ones extracted from the ground truth frame. A 
radius match algorithm is used, where the matching threshold is set 
to 7m = 0.18*512 ~ 102. Finally, geometric coherence of matches 
is enforced resorting to the RANSAC algorithm. Finally, the number 
of Matches-Post-Ransac (MPR) is employed as the accuracy mea¬ 
sure. 

In the case of the Rome Landmark dataset, the accuracy of the 
task was evaluated according to the Mean Average Precision (MAP). 
Given an input query sequence q, for each frame X q . n it is possible 
to define the Average Precision as 


APq,n — 


E 


k =i 


Ra 


(5) 


where P q , n {k) is the precision (i.e., the fraction of relevant docu¬ 
ments retrieved) considering the top-fc results in the ranked list of 
database images; r 9i „(fc) is an indicator function, which is equal to 
1 if the item at rank k is relevant for the query, and zero otherwise; 
R q ,n is the total number of relevant document for frame T q , n of the 
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Fig. 2. Energy-Accuracy curves for Stanford MAR dataset. 

query sequence q and Z is the total number of documents in the list. 
The overall accuracy for the query sequence q is evaluated according 
to 


Fig. 3. Energy-Accuracy curve for the Rome Landmark dataset, 
when using the Intensity Difference Detection Mask in order to re¬ 
duce the detection area and with different values for the thresholding 
parameter. The computational time for each frame can be reduced 
from 28ms to 18ms, without significantly affecting the accuracy of 
the task. 


AP q 


E 


N 

n= 


1 APq t n 

~N 


(6) 


where N is the total number of frames of the query video q. 
Finally, the Mean Average Precision is obtained as 


MAP = 


Q 


(7) 


that is, the mean of the MAP q measure over all the query sequences. 

In the case of the Stanford MAR multiple object video set, the ac¬ 
curacy is measured according to a combined detection and tracking 
precision metric. In particular, for each frame, the goal is to cor¬ 
rectly detect the portrayed database object and to identify its posi¬ 
tion within the frame. Each frame of the video sequences is matched 
against all the database object. Radius match and geometric verifi¬ 
cation steps are performed as in the case of Stanford MAR dataset 
scenario. The matching object is the one with the highest number of 
matches-post-RANSAC. The bounding box for the identified object 
is obtained by projecting the database object corners according to 
the homography computed with the RANSAC algorithm at the pre¬ 
vious step. Each frame is deemed as correct if the correct object is 
detected, and if the estimated position is consistent with the ground- 
truth information. As to the latter, the estimated object position is 
deemed correct if the displacement between the estimated centroid 
and the ground truth one is lower than a threshold. We set the value 
of such a threshold to 10 pixels. 

We evaluated the complexity of the feature extraction methods 
by means of the required CPU time. We performed our tests on a 
laptop equipped with a 2.5GHz Intel Core i5 processor and 10 GB 
of RAM. 

Results: As an illustrative example, Figure [7] shows the re¬ 
sults obtained for the Alicia Keys sequence. The charts also report 
the results obtained when detection is performed independently on a 
frame-by-frame basis (full detection) to serve as a reference. We ob¬ 
serve that the method using the Intensity Difference Detection Mask 
(threshold 20) achieves an accuracy level similar to that of full de¬ 
tection ( MPR = 55 vs. 56), at a reduced computational time (20.5 
ms vs. 24.5 ms). As for Temporally Coherent Detector , it leads to 
a significant loss in terms of accuracy ( MPR = 41), while being 
quite computationally intensive (72 ms on average). While accuracy 
could be further improved by resorting to matching based on affine 
warping, this would further increase its complexity. This confirms 
the fact that this detector was originally designed with the goal of 
maximizing coding efficiency rather than computational cost. Since 



Fig. 4. Energy-Accuracy curves for the Stanford MAR multiple ob¬ 
ject sequences. 

this is confirmed also on other test sequences, we do not report addi¬ 
tional results for this detector. 

It is interesting to observe the energy-accuracy trade-off that can 
be achieved by varying the threshold used by the algorithms based on 
detection masks. To this end. Figure[2]compares the performance of 
Intensity Difference Detection Mask and Keypoint Binning Detection 
Mask with that of full detection, averaging the results on the Stanford 
MAR dataset. The two methods based on a detection mask performs 
on a par, reducing the required computational time by 30% while 
losing as few as 4 matches. 

Furthermore, we tested our approach based on a Detection Mask 
on the Rome Landmark Dataset. Figure[3]compares the results of In¬ 
tensity Difference Detection Mask with that of full detection, show¬ 
ing that computational time can be reduced by about 35% without 
affecting task accuracy. Furthermore, the feature extraction process 
can be speeded un by 3 times at the cost of 0.03% lower Mean Aver¬ 
age Precision. 

Finally, the results of our fast detection algorithms on the Stan¬ 
ford MAR multiple object video set are reported in Figure [4] The 
computational time can be be reduced up to 40% without signifi¬ 
cantly impairing object detection and tracking performance. 

5. CONCLUSIONS 

In this paper we presented a method for fast keypoint detection in 
video sequences based on Detection Masks. Results show that the 
proposed approach allows for a reduction in terms of computational 
complexity of up to 35% without significantly impair task perfor¬ 
mance. In our future investigation we plan to further improve the 
Detection Mask building process, by introducing more sophisticated 
yet computationally efficient solutions. 
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